Masthead

Creating Synthetic Data in R

To evaluate new methods and to diagnose problems with modeling processes, we often need to generate synthetic data. This allows us to precisely control the data going into our modeling methods and then check the output to see if it is as expected. In this lab, you'll use R to create point and raster data sets for use in trend surface and interpolation analysis.

1. Creating Random Grid Data

First, let's create a single array with some random data in R:

# Create a one dimensional array for our independent variable

NumEntries=100
X=1:NumEntries # create an array with 1,2,3,...

# Create an array with random values as our dependent data

RandomMean=0
RandomStdDev=1

RandomValues=rnorm(NumEntries, mean = RandomMean, sd = RandomStdDev);
Y=RandomValues # one number between -1 and 1

plot(Y)
   

When you run the code above, you should see a plot of random values between -1 and 1.

The random function does not create truly random numbers because computers are deterministic machines. However, for our purposes, these numbers will be just fine.

The code above uses the "rnom()" function which creates random values from a normal distribution. This is the most commonly used but there are other function in R to create random values from other distributions.

Now try different values for the mean and standard deviation.

Question 1: What effect does the mean and standard deviation have on the data?

2. Create a Trend

A trend is another term for correlation where there is some trend in the data based on some phenomenon that we can measure. There is a large area of modeling that uses polynomial expressions to model phenomenon. As a review of polynomials, remember that the equation for a line is:

y=mx+b

Where m is the slope of the line and b is the intercept. When we are doing regression, the "b" represents the value of x when the covariant is 0. The "m" is than the relationship between x and y. Another way to say this is if "m" is small, then y changes little as x changes, if "m" is large, then y changes a lot as x changes. Since the exponent on "x" is one, this is referred to as a "first order" polynomial.

Note: When we fit a model to data, m and b are the "parameters", also called "coefficients" for this model.

In statistics, we replace m and b (or a and b) with B0 and B1. This allows us to create higher order functions.

Add the code below to create a trend and plot it.

# Setup the parameters for the trend

B0=1 # intercept
B1=1 # slope

Y=B0+B1*X

plot(Y)
abline(B0,B1)

Question 2: What effect does setting B1 to 10 have? What effect does setting B1 to -1 have?

Question 3: What effect does changing B0 have?

Try other values until you are comfortable creating linear data in R.

3. Combining a trend with random data

Add the code below to add a trend to the data and plot the result.

# Combine the trend and range data

Y=B0+B1*X + RandomValues

Question 4: What effect does increasing and decreasing the value of the standard deviation in the random function have?

Try different values for each of the coefficients until you are comfortable with the impact that random effects and linear trends have on data.

4. Evaluating a Trend Surface

Remember the "lm()" function from last weeks lab? The format for this function is:

# Create the model
RegressionModel=lm(Y~X)

# output the coefficients of the model
print(RegressionModel)

# output additional information on the model
summary(RegressionModel)

Where Y is the response variable and X is the covariate variable. You can also add additional covariates. See my "R" web site for how to interpret the outputs from "print(...)" and "summary(...)".

Try different models, plot and print them to see if R can recreate your original models. Now increase the number of values in your data set. Also, increase and reduce the magnitude of your random component and examine whether the models improve with the addition of random data.

Note: Running lm() is the equivalent of running the "Trend" tool in ArcGIS. You'll find that the tools in ArcGIS tend to be easier to use while the tools in R have more flexibility.

Question 5: How well does R find the original coefficients of your polynomials?

Removing the trend

Now we can remove the trend from our data by simply subtracting a prediction from our "data". To create a prediction from our model, we do need to convert our array into a data frame. Then, we can subtract our predictions from our model to find the residuals and histogram them.

# create a new array of x values for the covariates and convert it to a data frame
X1=1:100
X1 = as.data.frame(X1)

# create a prediction for Y based on our new X values and the fitted model
Y1=predict.lm(RegressionModel,newdata=X1) 
plot(Y1)

# subtract the original data points from the predicted values to find the residuals
Residuals=Y1-Y

# create a histogram and a qqplot of the residuals
hist(Residuals)
qqnorm(Residuals)

Create histograms for the original response values (Y), your predicted trend surface, and your residuals.

Question 6: How good a job did the prediction do at removing the trend in your data?

5. Getting Control of Polynomials (Optional)

Add additional coefficients to the model to add higher order functions. Adding a square term makes the function "quadratic", cubing X makes it a cubic and so on. This is referred to as raising the "Degree of the Polynomial".

# Combine the trend and range data

Y=B0+B1*X+B2*X^2+B3*X^3 + RandomValues

Question 7: What effect does increasing and decreasing the values of B3 and B4? Remember to try negative numbers.

You may find that it is challenging to get anything other than a straight line or a single exponential curve. To see something more interesting, you'll need to think about what is happening with each piece of the equation. As you add the higher order coefficients, remember that they will have larger values so you'll need to increase the lower order coefficients for them to have an effect. Try making the lower order ones 10 times as large as the next-highest order coefficient.

The most important learning here is how challenging it is to have polynomials represent complex phenomena. Polynomials have their place but they are challenging to work with and typically do not respond in the way that natural spatial phenomena do. Over the next weeks, we'll be learning other techniques that use different mathematics to create spatial models.

Note that you can add additional covariants to a polynomial very easily. The general form for a multivariate linear (first order) equation is then:

y=B0+B1*x1+B2*x2+B3*x3...

Where B0 is the intercept and B1, B2, and B3 are the slope values ("m" from above) that determine how y responds to each x value.

The "lm()" function we have been using is named for "linear model" but it can actually create models for multidimensional, higher-order, polynomials. Update your model for the additional coefficients and see how well lm() performs.

6. Creating data with auto correlation

Another phenomenon in the real world is that things that are closer together tend to be more alike. This can be because of a trend that is from another phenomenon or because trees and other species tend to spread seeds near themselves more than far away. After we remove any trends, we want to understand if there is any auto correlation in the data.

Auto correlation is often a trend that has yet to be discovered

The gradient dataset from above is highly auto-correlated but this is also an easy trend to detect. Trigonometric functions (Sine and Cosine) can be used to create patterns of values that change spatially over a grid. Below is a method for adding some fake auto-correlated data.

# autocorrelated component

Frequency=0.1
Magnitude=20

Y=B0+B1*X + RandomValues + sin(X*Frequency)*Magnitude

plot(Y)

Cchange the frequency and magnitude of the auto correlation to see it's effect on the data.

Below is code for R that will compute a Moran's I statistic for a linear array.

# simplified Moran's i (one dimension)

TheMean=mean(Y) # find the overall mean of the array

# find the sum of squares 

SumOfSquares=0
for (i in 1:NumEntries ) 
{
  SumOfSquares=SumOfSquares+(Y[i]-TheMean)^2
}
print(SumOfSquares)

# find the sum of the weighted differences
TheWeights=0
TheSum=0
for (i in 1:NumEntries ) 
{
  for (j in 1:NumEntries ) 
  {
    if (i!=j) # don't compute differences between the same values
    {
      TheWeight=1/((i-j)^2) # compute a weight based on position in the array (i-j)
      TheWeights=TheWeights+TheWeight # add the weights to an overall sum of weights
      
	  # the key equation where were add the weighted differences from the mean and sum it
      TheSum=TheSum+TheWeight*(Y[i]-TheMean)*(Y[j]-TheMean)
    }
  }
}
# scale the sum to be from 0 to 1
MoransI=(NumEntries/TheWeights)*(TheSum/SumOfSquares)
print(MoransI) 

Question 8: What is the value of Moran's I?

7. Putting it all together

Your task is to create the R code to create a set of data that includes some auto correlation that is not part of a trend surface and some random noise. Then, use a model to evaluate the data set and remove the trend. Then, find the residuals and determine how much auto correlation is in the data.

To remove the auto correlation, we would need to use a semi-variogram to determine the amount of auto-correlation and then created a Kriged surface which we would subtract from our data. We do not have a tool to perform this on 1 dimensional data so we'll wait to tackle that.

Repeat this process over and over trying different values for the amount of random, auto correlated, and trend surface in the "model". Get a feel for how these change the data and the effect is has on your ability to pull out the trend surface.

You'll want to save this code for when you need similar code in the future. Document the script with a file header block and appropriate in-line comments.

Note that the process you have just completed is very similar to how we need to treat most data we are modeling. You have point data with measured values and you are trying to find trends, auto correlation, and random noise in the data. The trends are typically from covariants while auto correlation represents interactions between the topic of interest and itself (and possibly other things). Also, all the data we have has some noise and the world is pretty noisy so we have to expect a good bit of noise to show up.

8. Creating Two Dimensional Data (optional)

Work with two-dimensional array in R is very similar to working with one dimensional arrays except you have to initialize them differently and use two indexes (values in brackets) to access the values.

# parameters

Width=10
Height=20

RandomMean=0
RandomStdDev=1

# setup the array

NumEntries=Width*Height
TheArray=array(1:NumEntries,dim=c(Height,Width))

# random component

for (i in 1:Height ) {
  for (j in 1:Width) {
    RandomValue=rnorm(1, mean = RandomMean, sd = RandomStdDev);
    TheArray[i,j]=TheArray[i,j]+RandomValue # one number between -1 and 1
  }
}
# plot the result in 3D

persp(TheArray,theta=60,phi=30) 

 

© Copyright 2018 HSU - All rights reserved.